This disclosure relates generally to techniques for processing electronic messages and documents. More particularly, the disclosure provides techniques to efficiently find similar and near-duplication electronic messages and files.
Collaboration using electronic messaging, such as email and instant messaging is becoming increasingly ubiquitous. Many users and organizations have transitioned to “paperless” offices, where information and documents are communicated almost exclusively using electronic messaging. Also, “paper” based documents can be scanned and converted to electronic files using OCR (Optical character recognition). As a result, users and organizations are also now expending time and money to sort and archive increasing volumes of digital documents and data.
At the same time, state and federal regulators such as the Federal Energy Regulatory Commission (FERC), the Securities and Exchange Commission (SEC), and the Food and Drug Administration (FDA) have become increasingly aggressive in enforcing regulations requiring storage, analysis, and reporting of information based on electronic messages. Additionally, criminal cases and civil litigation frequently employ electronic discovery techniques, in addition to traditional discovery methods, to discover information from electronic documents and messages.
One problem with electronically storing information is that complying with disclosure requirements or reporting requirements is difficult because of the large amounts of data that may accumulate. As broadband connections to the Internet are common in most homes and businesses, emails frequently include one or more multi-megabyte attachments. Moreover, these emails and attachments are increasingly of diverse and propriety formats, making later access to data difficult without the required software.
Another problem is that disclosure requirements or reporting requirements do not simply require that the electronic message be preserved and then disclosed. Often, the disclosure requirements or reporting requirements are more focused toward the disclosure or report on information about the electronic message, such as who had access to sensitive data referred to in the contents of a particular electronic message. Some companies have teams of employees spending days and weeks reviewing emails in order to respond to regulatory audits and investigations. For these reasons, the inventors believe that users and organizations need electronic message analysis solutions to help lower costs in disclosing and/or reporting information related to electronic messaging and other electronically stored information.
In electronic discovery, whether it is for early case assessment or for improving speed and accuracy of review, it is critically important to identify as many responsive documents as is possible. Unlike typical web search engine technologies which focuses on identifying only a handful of most relevant documents, electronic discovery invariably is about minimizing the risks of overlooking relevant documents and minimizing expenses. This shifts the technical challenge from optimizing precision (finding only relevant documents) into one of increasing recall (finding most of the relevant documents). One aspect of message analysis is the ability to find similar or near duplicate documents. Once similar/near duplicate documents are found for given document, user can perform desired action on that bulk of documents and thereby save time.
Accordingly, what is desired is to solve problems relating to finding similar and near-duplicate documents, some of which may be discussed herein. Additionally, what is desired is to reduce drawbacks related to finding similar and near-duplicate documents, some of which may be discussed herein.